Research on Multi-Label Propagation Clustering Method for Microblog Hot Topic Detection
CHEN Yu-Zhong, FANG Ming-Yue, GUO Wen-Zhong
Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350108 College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108
Abstract:With the rapid growth of microblog data, extracting hot topics from vast amounts of microblog posts has become a research hotspot. The traditional methods for hot term extraction can hardly apply to microblog data, thus a life value calculation model based on aging theory is established to extract hot terms. Then, a hot term co-occurrence network is built based on the correlations between hot terms. Aiming at the problem that traditional clustering methods can hardly handle the hot term overlap between different topics and can not deal with vast amounts of data efficiently, a term clustering method based on multi-label propagation algorithm (TCMLPA), which has a nearly linear time complexity, is proposed to detect hot topics in hot term co-occurrence network.The experimental results show that life value calculation model can filter noise and extract hot terms effectively. Meanwhile, TCMLPA ensures the stability of clustering result and improves the accuracy and efficiency of hot topic detection.
[1] Allan J, Carbonell J, Doddington G, et al. Topic Detection and Tracking Pilot Study Final Report // Proc of the DARPA Broadcast News Transcription and Understanding Workshop. Lansdowne, USA, 1998: 194-218 [2] Brants T, Chen F, Farahat A. A System for New Event Detection // Proc of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Toronto, Canada, 2003: 330-337 [3] Bun K K, Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm // Proc of the 3rd International Conference on Web Information Systems Engineering. Singapore, Singapore, 2002: 73-82 [4] Chen K Y, Luesukprasert L, Chou S C T. Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling. IEEE Trans on Knowledge and Data Engineering, 2007, 19(8): 1016-1025 [5] Zeng Y L, Xu H B, Bai S. Research on the Extraction and Organization of Key Phrases in Web Texts. Journal of Chinese Information Processing, 2008, 22(3): 64-70,80 (in Chinese) (曾依灵, 许洪波,白 硕.网络文本主题词的提取与组织研究.中文信息学报, 2008, 22(3): 64-70,80) [6] Chen Y, Xu B, Hao H W, et al. User-Defined Hot Topic Detection in Microblogging // Proc of the International Conference on Internet Multimedia Computing and Service. Huangshan, China, 2013: 183-186 [7] Zhou Y D, Sun Q D, Guan X H, et al. Internet Popular Topics Extraction of Traffic Content Words Correlation. Journal of Xi′an Jiaotong University, 2007, 41(10): 1142-1145,1150 (in Chinese) (周亚东,孙钦东,管晓宏,等.流量内容词语相关度的网络热点话题提取.西安交通大学学报, 2007, 41(10): 1142-1145,1150) [8] Long Z Y, Cheng W. Kind of Hot Topic Detection Algorithm Based on Clustering Keywords. Computer Engineering and Design, 2011, 32(6): 2214-2217 (in Chinese) (龙志祎,程 葳.基于词聚类的热点话题检测算法.计算机工程与设计, 2011, 32(6): 2214-2217) [9] Zhang L M, Jia Y, Zhou B, et al. Detecting Real-Time Burst Topics in Microblog Streams: How Sentiment Can Help // Proc of the 22nd International Conference on World Wide Web Companion. Rio de Janeiro, Brazil, 2013: 781-782 [10] Chen C C, Chen Y T, Sun Y L, et al. Life Cycle Modeling of News Events Using Aging Theory // Proc of the 14th European Conference on Machine Learning. Cavtat-Dubrovnik, Croatia, 2003: 47-59 [11] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-Based Chinese Lexical Analyzer ICTCLAS // Proc of the 2nd SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan, 2003: 184-187 [12] Swan R C, Allan J. Automatic Generation of Overview Timelines // Proc of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Athens, Greece, 2000: 49-56 [13] Cataldi M, Di Caro L, Schifanella C. Emerging Topic Detection on Twitter Based on Temporal and Social Terms Evaluation // Proc of the 10th International Workshop on Multimedia Data Mining. Washington, USA, 2010. DOI:10.1145/1814245.1814249 [14] Raghavan U N, Albert R, Kumara S. Near Linear Time Algorithm to Detect Community Structures in Large-Scale Networks. Physical Review E, 2007, 76(3): 036106-1-036106-11 [15] Gregory S. Finding Overlapping Communities in Networks by Label Propagation. New Journal of Physics, 2010. DOI:10.1088/1367-2630/12/10/103018 [16] Xie J R, Kelley S, Szymanski B K. Overlapping Community Detection in Networks: The State of the Art and Comparative Study. ACM Computing Surveys, 2013, 45(4): 43:1-43:35 [17] Cao Y J, Niu Z D, Zhao K, et al. Near Duplicated Web Pages Detection Based on Concept and Semantic Network. Journal of Software, 2011, 22(8): 1816-1826 (in Chinese) (曹玉娟,牛振东,赵 堃,等.基于概念和语义网络的 近似网页检测算法.软件学报, 2011, 22(8): 1816-1826) [18] Matsuo Y, Ohsawa Y, Ishizuka M. KeyWorld: Extracting Keywords from a Document as a Small World // Proc of the 4th International Conference on Discovery Science. Washington, USA, 2001: 271-281 [19] Geng C X, Zhu X G, Nie P Y N, et al. Bursty Hot-Words Detection for Campus BBS. TELKOMNIKA Indonesian Journal of Electrical Engineering, 2013, 11(6): 3213-3219 [20] Weng J S, Yao X Y, Leonardi E, et al. Event Detection in Twitter. Technical Reports, HPL-2011-98. Palo Alto, USA: HP Laboratories, 2011-07-06